#imports
import os
import tensorflow as tf
from tensorflow import keras,strings
from tensorflow.keras import utils
from tensorflow.keras import layers,losses
import pandas as pd
import re
import string
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
from plotly import express as px
import plotly.io as pio
= 'colab'
pio.renderers.default
True) tf.config.run_functions_eagerly(
For this assignment we’re using Tensorflow (again!) to make a fake news classifier. Like always, we start with all the imports we’ll be using throughout the project.
Data Preparation
We’ll be using a dataset of news articles that includes their title, text content, and whether or not they contain fake news. The data from the below URL is training data only; we’ll be importing the testing data once we have a decent running model.
#import training data
= "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"
train_url = pd.read_csv(train_url) df
df.head()
Unnamed: 0 | title | text | fake | |
---|---|---|---|---|
0 | 17366 | Merkel: Strong result for Austria's FPO 'big c... | German Chancellor Angela Merkel said on Monday... | 0 |
1 | 5634 | Trump says Pence will lead voter fraud panel | WEST PALM BEACH, Fla.President Donald Trump sa... | 0 |
2 | 17487 | JUST IN: SUSPECTED LEAKER and “Close Confidant... | On December 5, 2017, Circa s Sara Carter warne... | 1 |
3 | 12217 | Thyssenkrupp has offered help to Argentina ove... | Germany s Thyssenkrupp, has offered assistance... | 0 |
4 | 5535 | Trump say appeals court decision on travel ban... | President Donald Trump on Thursday called the ... | 0 |
In natural language processing it’s common to remove stopwords, a set of commonly used neutral words in the language, from input data. This allows our models to spend more energy analyzing meaningful words. We can download a set of English stopwords from the nltk
package.
Here, we implement this preprocessing as part of make_dataset()
, which deletes stopwords from a given dataframe and turns it into a Tensorflow Dataset
object. This is achieved using the from_tensor_slices()
function. To allow for multiple inputs to be processed later, we pass a tuple of dictionaries, where the first dictionary corresponds to text and title inputs and the second dictionary correspond to the label, or the output of whether or not the corresponding article contains fake news.
Finally, we batch the function for increased training speed before returning it.
import nltk
'stopwords')
nltk.download(
from nltk.corpus import stopwords
= stopwords.words('english')
stop
def make_dataset(df):
= df.copy()
df
= lambda words: ' '.join(word.lower() for word in words.split() if word not in stop)
drop_stopwords
#drop stopwords from title and text columns
"title"] = df["title"].apply(drop_stopwords)
df["text"] = df["text"].apply(drop_stopwords)
df[
#create and batch dataset from dataframe
= tf.data.Dataset.from_tensor_slices(({"title":df[["title"]],"text":df[["text"]]}, {"fake":df[["fake"]]}))
data_set = data_set.batch(100)
data_set return data_set
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
= make_dataset(df) data
Next, we split 20% of our data to be used as validation data, and use the remainder to train our model.
#separate 20% of data as validation data
= int(0.2*len(data))
val_size
= data.take(val_size)
val = data.skip(val_size) train
Let’s take a look at the distribution of fake news in our dataset to determine the base rate. Like with Homework 4, we construct an iterator that allows us to extract the “fake” column of our dataset and use some list functions to determine the number of fake and true articles.
= train.unbatch().map(lambda data_dict,fake: fake["fake"]).as_numpy_iterator() labels_iterator
/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/ops/structured_function.py:256: UserWarning:
Even though the `tf.config.experimental_run_functions_eagerly` option is set, this option does not apply to tf.data functions. To force eager execution of tf.data functions, please use `tf.data.experimental.enable_debug_mode()`.
WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
#count fake and true news articles in dataset
= [i for i in labels_iterator]
labels
= sum(labels)
num_fake = len(labels) - num_fake
num_true print(num_fake,num_true)
[9412] [8537]
There are 9412 articles containing fake news and 8537 without, so the base model will guess that a given news article is fake with probability \(\frac{9412}{9412+8537} \approx 52.4\%\), which is the base rate.
Creating Models
Before we assemble models using the Tensorflow Functional API, we need to prepare some layers. Most neural network layers can’t immediately process text, so we need to convert text to integer sequences before they can be interpreted and analyzed. The TextVectorization
layer object does this for us. For the purposes of this model we use a size_vocabulary
of 2000 (this tells the layer to focus on the 2000 most meaningful distinct words) and output_sequence_length
of 500. Furthermore, we can add a standardization parameter that in our case lowercases and removes punctuation from all text through the standardization()
function.
We make two layers; one for interpreting article titles and one for interpreting article contents. Once created, each layer can be adapted to their respective data inputs.
#prepare TextVectorization layers for both title and text inputs
= 2000
size_vocabulary
#more preprocessing
def standardization(input_data):
= tf.strings.lower(input_data)
lowercase = tf.strings.regex_replace(lowercase,
no_punctuation '[%s]' % re.escape(string.punctuation),'')
return no_punctuation
= layers.TextVectorization(
title_vectorize_layer =standardization,
standardize=size_vocabulary, # only consider this many words
max_tokens='int',
output_mode=500)
output_sequence_length
#adapt layer to title vocabulary
map(lambda x, y: x["title"]))
title_vectorize_layer.adapt(train.
= layers.TextVectorization(
text_vectorize_layer =standardization,
standardize=size_vocabulary, # only consider this many words
max_tokens='int',
output_mode=500)
output_sequence_length
#adapt layer to text vocabulary
map(lambda x, y: x["text"])) text_vectorize_layer.adapt(train.
In Homework 4, we used Tensorflow’s Sequential API to construct neural networks. The limitation there is that the sequence of layers is strictly linear. In reality, however, neural networks are often closer to a directed graph, where layers can branch off, be reused, and accept multiple inputs. The Functional API better implements this.
To create a model using the Functional API, we start by defining Input
objects. In particular these take in a name
parameter that corresponds to keys in the dictionaries in our dataset.
#create inputs
= keras.Input(shape=(1,),name="title",dtype="string")
title_inputs = keras.Input(shape=(1,),name="text",dtype="string") text_inputs
Our first model will only consider article titles. Along with Dropout()
, which we used in Homework 4, and the title_vectorize_layer
that we defined early, we introduce an Embedding()
layer, which turns the integer sequences outputted by the vectorization layer and turns them into dense vectors, as well as a GlobalAveragePooling1D()
layer, which averages data across one dimension. We complete the model by assigning a Dense()
layer taking in previous layers to title_outputs
, and then constructing title_model
, a keras.Model
object.
#model that takes in title only
= title_vectorize_layer(title_inputs)
x = layers.Embedding(size_vocabulary,output_dim=3,name="embedding")(x)
x = layers.Dropout(0.1)(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(2,activation='softmax',name="fake")(x) title_outputs
= keras.Model(inputs=title_inputs,outputs=title_outputs) title_model
Here’s what our model looks like:
title_model.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
title (InputLayer) [(None, 1)] 0
text_vectorization (TextVec (None, 500) 0
torization)
embedding (Embedding) (None, 500, 3) 6000
dropout (Dropout) (None, 500, 3) 0
global_average_pooling1d (G (None, 3) 0
lobalAveragePooling1D)
fake (Dense) (None, 2) 8
=================================================================
Total params: 6,008
Trainable params: 6,008
Non-trainable params: 0
_________________________________________________________________
In fact, we can visualize this in the form of a directed graph using utils.plot_model()
. Right now our model is entirely linear, though, so it won’t look very interesting.
utils.plot_model(title_model)
Now we can compile and train our model just like in the previous homework.
#compile model
= tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
sce
compile(optimizer='adam',
title_model.= sce,
loss = ['accuracy']) metrics
#train model on training data
= title_model.fit(train,epochs=20,validation_data=val) title_history
Epoch 1/20
/usr/local/lib/python3.9/dist-packages/keras/engine/functional.py:638: UserWarning:
Input dict contained keys ['text'] which did not match any model input. They will be ignored by the model.
180/180 [==============================] - 11s 40ms/step - loss: 0.6913 - accuracy: 0.5210 - val_loss: 0.6904 - val_accuracy: 0.5173
Epoch 2/20
180/180 [==============================] - 7s 38ms/step - loss: 0.6879 - accuracy: 0.5244 - val_loss: 0.6862 - val_accuracy: 0.5173
Epoch 3/20
180/180 [==============================] - 7s 40ms/step - loss: 0.6821 - accuracy: 0.5253 - val_loss: 0.6790 - val_accuracy: 0.5173
Epoch 4/20
180/180 [==============================] - 7s 38ms/step - loss: 0.6732 - accuracy: 0.5725 - val_loss: 0.6689 - val_accuracy: 0.5178
Epoch 5/20
180/180 [==============================] - 7s 41ms/step - loss: 0.6615 - accuracy: 0.6629 - val_loss: 0.6559 - val_accuracy: 0.5696
Epoch 6/20
180/180 [==============================] - 8s 44ms/step - loss: 0.6471 - accuracy: 0.7759 - val_loss: 0.6406 - val_accuracy: 0.6947
Epoch 7/20
180/180 [==============================] - 7s 40ms/step - loss: 0.6305 - accuracy: 0.8585 - val_loss: 0.6233 - val_accuracy: 0.8051
Epoch 8/20
180/180 [==============================] - 7s 38ms/step - loss: 0.6123 - accuracy: 0.9032 - val_loss: 0.6044 - val_accuracy: 0.8738
Epoch 9/20
180/180 [==============================] - 7s 41ms/step - loss: 0.5927 - accuracy: 0.9223 - val_loss: 0.5845 - val_accuracy: 0.9031
Epoch 10/20
180/180 [==============================] - 7s 40ms/step - loss: 0.5724 - accuracy: 0.9339 - val_loss: 0.5639 - val_accuracy: 0.9233
Epoch 11/20
180/180 [==============================] - 7s 38ms/step - loss: 0.5517 - accuracy: 0.9381 - val_loss: 0.5431 - val_accuracy: 0.9322
Epoch 12/20
180/180 [==============================] - 7s 41ms/step - loss: 0.5306 - accuracy: 0.9402 - val_loss: 0.5221 - val_accuracy: 0.9400
Epoch 13/20
180/180 [==============================] - 8s 42ms/step - loss: 0.5095 - accuracy: 0.9438 - val_loss: 0.5013 - val_accuracy: 0.9416
Epoch 14/20
180/180 [==============================] - 7s 37ms/step - loss: 0.4889 - accuracy: 0.9433 - val_loss: 0.4810 - val_accuracy: 0.9447
Epoch 15/20
180/180 [==============================] - 7s 40ms/step - loss: 0.4690 - accuracy: 0.9436 - val_loss: 0.4612 - val_accuracy: 0.9440
Epoch 16/20
180/180 [==============================] - 7s 41ms/step - loss: 0.4491 - accuracy: 0.9449 - val_loss: 0.4420 - val_accuracy: 0.9433
Epoch 17/20
180/180 [==============================] - 7s 37ms/step - loss: 0.4305 - accuracy: 0.9443 - val_loss: 0.4236 - val_accuracy: 0.9460
Epoch 18/20
180/180 [==============================] - 7s 40ms/step - loss: 0.4126 - accuracy: 0.9454 - val_loss: 0.4060 - val_accuracy: 0.9473
Epoch 19/20
180/180 [==============================] - 7s 41ms/step - loss: 0.3953 - accuracy: 0.9465 - val_loss: 0.3892 - val_accuracy: 0.9484
Epoch 20/20
180/180 [==============================] - 7s 38ms/step - loss: 0.3787 - accuracy: 0.9466 - val_loss: 0.3731 - val_accuracy: 0.9493
That’s a pretty impressive accuracy, and there doesn’t seem to be any overfitting! Let’s borrow our plot_accuracies()
function from last time to visualize this.
def plot_accuracies(history,name):
#plot validation accuracies of model with name
= history.history["val_accuracy"]
accuracies = plt.subplots()
fig,ax range(1,len(accuracies)+1),accuracies)
ax.plot(
+" Fitting History")
ax.set_title(name"Epoch")
ax.set_xlabel("Validation Accuracy")
ax.set_ylabel(range(1,len(accuracies)+1,2))
ax.set_xticks(
return fig
= plot_accuracies(title_history,"Title Model") title_plot
So it seems like just the title alone is a pretty good indicator of whether or not an article contains fake news, but can we do better? What if we considered the article’s contents instead? Our next model will be the same as the previous one, except it takes in text_inputs
and uses the text_vectorize_layer
instead.
#same process for text input only
#create model
= text_vectorize_layer(text_inputs)
x = layers.Embedding(size_vocabulary,output_dim=3,name="embedding")(x)
x = layers.Dropout(0.1)(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(2,activation='softmax',name="fake")(x)
text_outputs
= keras.Model(inputs=text_inputs,outputs=text_outputs)
text_model
#compile model
= tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
sce
compile(optimizer='adam',
text_model.= sce,
loss = ['accuracy'])
metrics
#train model
= text_model.fit(train,epochs=20,validation_data=val) text_history
Epoch 1/20
3/180 [..............................] - ETA: 6s - loss: 0.6936 - accuracy: 0.4833
/usr/local/lib/python3.9/dist-packages/keras/engine/functional.py:638: UserWarning:
Input dict contained keys ['title'] which did not match any model input. They will be ignored by the model.
180/180 [==============================] - 8s 47ms/step - loss: 0.6820 - accuracy: 0.5343 - val_loss: 0.6686 - val_accuracy: 0.5596
Epoch 2/20
180/180 [==============================] - 8s 47ms/step - loss: 0.6467 - accuracy: 0.6862 - val_loss: 0.6231 - val_accuracy: 0.7462
Epoch 3/20
180/180 [==============================] - 8s 45ms/step - loss: 0.5938 - accuracy: 0.8483 - val_loss: 0.5651 - val_accuracy: 0.8567
Epoch 4/20
180/180 [==============================] - 8s 43ms/step - loss: 0.5339 - accuracy: 0.9059 - val_loss: 0.5058 - val_accuracy: 0.9027
Epoch 5/20
180/180 [==============================] - 8s 47ms/step - loss: 0.4765 - accuracy: 0.9232 - val_loss: 0.4524 - val_accuracy: 0.9184
Epoch 6/20
180/180 [==============================] - 9s 47ms/step - loss: 0.4265 - accuracy: 0.9292 - val_loss: 0.4074 - val_accuracy: 0.9251
Epoch 7/20
180/180 [==============================] - 8s 43ms/step - loss: 0.3849 - accuracy: 0.9339 - val_loss: 0.3703 - val_accuracy: 0.9331
Epoch 8/20
180/180 [==============================] - 8s 46ms/step - loss: 0.3505 - accuracy: 0.9372 - val_loss: 0.3399 - val_accuracy: 0.9389
Epoch 9/20
180/180 [==============================] - 9s 47ms/step - loss: 0.3221 - accuracy: 0.9412 - val_loss: 0.3148 - val_accuracy: 0.9409
Epoch 10/20
180/180 [==============================] - 8s 44ms/step - loss: 0.2981 - accuracy: 0.9435 - val_loss: 0.2937 - val_accuracy: 0.9438
Epoch 11/20
180/180 [==============================] - 8s 46ms/step - loss: 0.2778 - accuracy: 0.9475 - val_loss: 0.2758 - val_accuracy: 0.9460
Epoch 12/20
180/180 [==============================] - 8s 46ms/step - loss: 0.2602 - accuracy: 0.9506 - val_loss: 0.2602 - val_accuracy: 0.9476
Epoch 13/20
180/180 [==============================] - 8s 43ms/step - loss: 0.2453 - accuracy: 0.9535 - val_loss: 0.2468 - val_accuracy: 0.9493
Epoch 14/20
180/180 [==============================] - 8s 46ms/step - loss: 0.2318 - accuracy: 0.9553 - val_loss: 0.2349 - val_accuracy: 0.9507
Epoch 15/20
180/180 [==============================] - 8s 47ms/step - loss: 0.2197 - accuracy: 0.9570 - val_loss: 0.2244 - val_accuracy: 0.9522
Epoch 16/20
180/180 [==============================] - 8s 46ms/step - loss: 0.2094 - accuracy: 0.9584 - val_loss: 0.2150 - val_accuracy: 0.9524
Epoch 17/20
180/180 [==============================] - 8s 45ms/step - loss: 0.1993 - accuracy: 0.9611 - val_loss: 0.2065 - val_accuracy: 0.9547
Epoch 18/20
180/180 [==============================] - 8s 43ms/step - loss: 0.1903 - accuracy: 0.9623 - val_loss: 0.1988 - val_accuracy: 0.9564
Epoch 19/20
180/180 [==============================] - 8s 46ms/step - loss: 0.1820 - accuracy: 0.9641 - val_loss: 0.1918 - val_accuracy: 0.9571
Epoch 20/20
180/180 [==============================] - 8s 47ms/step - loss: 0.1748 - accuracy: 0.9650 - val_loss: 0.1854 - val_accuracy: 0.9589
= plot_accuracies(text_history,"Text Model") text_plot
This is a bit better than before, but there’s still room for improvement. You’ll notice that despite the Functional API allowing for nonsequential models, we haven’t actually put this to use yet. So next, let’s try making a model that takes in both inputs. We’ll process them the same as before up until GlobalAveragePooling1D()
, then concatenate them before applying a final Dense()
layer. We can use utils.plot_model()
again to visualize the model.
#create model that takes in both title and text as inputs
#two separate embedding layers for later plotting
= layers.Embedding(size_vocabulary,output_dim=10,name="titleEmbedding")
title_embedding = layers.Embedding(size_vocabulary,output_dim=10,name="textEmbedding")
text_embedding
= title_vectorize_layer(title_inputs)
x = title_embedding(x)
x = layers.Dropout(0.1)(x)
x = layers.GlobalAveragePooling1D()(x)
x
= text_vectorize_layer(text_inputs)
y = text_embedding(y)
y = layers.Dropout(0.1)(y)
y = layers.GlobalAveragePooling1D()(y)
y
#combine branches of model into one output
= layers.concatenate([x,y],axis=1)
main = layers.Dense(2,activation='softmax',name="fake")(main)
outputs
= keras.Model(inputs=[title_inputs,text_inputs],outputs=outputs)
model
#visualize model
utils.plot_model(model)
This model is more complicated than any of our earlier ones! Let’s see how it performs.
#compile model
= tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
sce
compile(optimizer='adam',
model.= sce,
loss = ['accuracy'])
metrics
#train model
= model.fit(train,epochs=20,validation_data=val) history
Epoch 1/20
180/180 [==============================] - 12s 66ms/step - loss: 0.6716 - accuracy: 0.5938 - val_loss: 0.6392 - val_accuracy: 0.6516
Epoch 2/20
180/180 [==============================] - 13s 70ms/step - loss: 0.5838 - accuracy: 0.8621 - val_loss: 0.5260 - val_accuracy: 0.8856
Epoch 3/20
180/180 [==============================] - 12s 68ms/step - loss: 0.4676 - accuracy: 0.9316 - val_loss: 0.4177 - val_accuracy: 0.9336
Epoch 4/20
180/180 [==============================] - 12s 65ms/step - loss: 0.3736 - accuracy: 0.9445 - val_loss: 0.3401 - val_accuracy: 0.9460
Epoch 5/20
180/180 [==============================] - 12s 66ms/step - loss: 0.3074 - accuracy: 0.9520 - val_loss: 0.2864 - val_accuracy: 0.9507
Epoch 6/20
180/180 [==============================] - 12s 66ms/step - loss: 0.2604 - accuracy: 0.9579 - val_loss: 0.2476 - val_accuracy: 0.9551
Epoch 7/20
180/180 [==============================] - 12s 68ms/step - loss: 0.2255 - accuracy: 0.9620 - val_loss: 0.2182 - val_accuracy: 0.9602
Epoch 8/20
180/180 [==============================] - 12s 65ms/step - loss: 0.1983 - accuracy: 0.9658 - val_loss: 0.1950 - val_accuracy: 0.9636
Epoch 9/20
180/180 [==============================] - 12s 66ms/step - loss: 0.1764 - accuracy: 0.9690 - val_loss: 0.1761 - val_accuracy: 0.9669
Epoch 10/20
180/180 [==============================] - 12s 66ms/step - loss: 0.1583 - accuracy: 0.9723 - val_loss: 0.1604 - val_accuracy: 0.9702
Epoch 11/20
180/180 [==============================] - 12s 66ms/step - loss: 0.1431 - accuracy: 0.9750 - val_loss: 0.1470 - val_accuracy: 0.9716
Epoch 12/20
180/180 [==============================] - 12s 65ms/step - loss: 0.1303 - accuracy: 0.9767 - val_loss: 0.1356 - val_accuracy: 0.9729
Epoch 13/20
180/180 [==============================] - 12s 67ms/step - loss: 0.1189 - accuracy: 0.9789 - val_loss: 0.1256 - val_accuracy: 0.9751
Epoch 14/20
180/180 [==============================] - 12s 66ms/step - loss: 0.1091 - accuracy: 0.9807 - val_loss: 0.1169 - val_accuracy: 0.9769
Epoch 15/20
180/180 [==============================] - 12s 66ms/step - loss: 0.1003 - accuracy: 0.9824 - val_loss: 0.1092 - val_accuracy: 0.9780
Epoch 16/20
180/180 [==============================] - 12s 66ms/step - loss: 0.0923 - accuracy: 0.9836 - val_loss: 0.1024 - val_accuracy: 0.9798
Epoch 17/20
180/180 [==============================] - 12s 66ms/step - loss: 0.0855 - accuracy: 0.9852 - val_loss: 0.0962 - val_accuracy: 0.9809
Epoch 18/20
180/180 [==============================] - 12s 67ms/step - loss: 0.0794 - accuracy: 0.9860 - val_loss: 0.0907 - val_accuracy: 0.9818
Epoch 19/20
180/180 [==============================] - 13s 74ms/step - loss: 0.0737 - accuracy: 0.9870 - val_loss: 0.0857 - val_accuracy: 0.9822
Epoch 20/20
180/180 [==============================] - 12s 69ms/step - loss: 0.0687 - accuracy: 0.9879 - val_loss: 0.0812 - val_accuracy: 0.9833
= plot_accuracies(history,"Combined Model") main_plot
It looks like taking into account both the title and text of an article is best for determining whether or not it contains fake news. Just to be sure, let’s now test it on some testing data.
#import testing data
= "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true"
test_url = pd.read_csv(train_url)
df = make_dataset(df) test
#evaluate main model on testing data
model.evaluate(test)
225/225 [==============================] - 8s 35ms/step - loss: 0.0687 - accuracy: 0.9874
[0.06867755204439163, 0.9873936772346497]
So the model has 98.74% accuracy on the testing data.
Plotting Embeddings
When we put text through TextVectorization()
and Embedding()
layers, we end up with a weight for each word in the vocabulary in the form of a vector. In general an outlier means that the word has significant impact on the outcome. We can visualize these weights by graphing them. However, the output of the embedding layer in our model is 10 dimensions, which isn’t very convenient for plotting! We can use a PCA
object imported from sklearn.decomposition
to transform these weights into a two-dimensional vector, which we can then plot. Since we have two different vocabularies, we’ll make two plots.
= PCA(n_components=2)
pca
def plot_weights(vocab,layer_name):
#get weights from embedding layer and transform to visualizable data
= model.get_layer(layer_name).get_weights()[0]
weights = pca.fit_transform(weights)
weights
#plot data
= pd.DataFrame({"word":vocab,"x_0":weights[:,0],"x_1":weights[:,1]})
df return px.scatter(df,
= "x_0",
x = "x_1",
y #size = [2]*len(df),
# size_max = 2,
= "word")
hover_name
#get vocabulary lists from adapted TextVectorization layers
= text_vectorize_layer.get_vocabulary()
text_vocab = title_vectorize_layer.get_vocabulary() title_vocab
"titleEmbedding") plot_weights(title_vocab,
The display is finicky when moving between Google Colab, Jupyter Notebook, and Quarto, so I’m inserting the graphs as figures. Unfortunately, they won’t be interactible, but we’ll do some analysis afterwards.
"textEmbedding") plot_weights(text_vocab,
In the plot of title weights, the leftmost point is for “video,” and in the same region as an outlier is “breaking.” It’s possible that these correspond to news articles with less fake news, since breaking news articles are generally concerned with big undeniable events, and videos are often included in articles as evidence. In particular if “video” is mentioned in the title of an article, the contents are likely to be real footage.
Some outliers in the text weights plot include “trumps” and “obamas” on the rightmost side, whereas “obama” is on the leftmost side. Perhaps articles about politicians themselves tend to be more trustworthy, but news about their families (if we interpret “trumps” and “obamas” as such) can often contain speculation.